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ABSTRACT 

Despite the central importance of crew safety in 
designing and operating a life support system, the metric 
commonly used to evaluate alternative Advanced Life 
Support (ALS) technologies does not currently provide 
explicit techniques for measuring safety. The resilience 
of a system, or the system’s ability to meet performance 
requirements and recover from component-level faults, 
is fundamentally a dynamic property. This paper 
motivates the use of computer models as a tool to 
understand and improve system resilience throughout 
the design process. Extensive simulation of a hybrid 
computational model of a water revitalization subsystem 
(WRS) with probabilistic, component-level faults provides 
data about off-nominal behavior of the system. The data 
can then be used to test alternative measures of 
resilience as predictors of the system’s ability to recover 
from component-level faults. A novel approach to 
measuring system resilience using a Markov chain model 
of performance data is also developed. Results 
emphasize that resilience depends on the complex 
interaction of faults, controls, and system dynamics, 
rather than on simple fault probabilities. 

INTRODUCTION 

Building complex systems such as life-support systems 
that are resilient in the face of unexpected events and 
component failures can be a difficult task. Even if the 
hardware of the system is fixed, creating robust control 
mechanisms is by no means easy. Yet arguably, 
resilience is the most crucial property of a life support 
system. This paper examines the problem of measuring a 
system’s resilience, and proposes some approaches for 
doing so. 

Measuring resilience is useful for a number of reasons. 
Most importantly it allows a direct comparison of the 
robustness of two proposed systems. This allows us to 


judge which system is more robust, but we can also use it 
to make two systems equally robust so that commonly 
used mass-based measures of the quality of life-support 
systems can be applied to them. It also provides us with a 
measure to optimize against, for example using it as an 
objective function for a learning controller for a system. 

What does it mean for a system to be resilient? There has 
recently been a great deal of interest in this topic in a 
wide range of fields, including software engineering, cell 
biology, sociology and ecology 1 . The Santa Fe Institute 
has gathered eighteen definitions of the term 
‘robustness’, ranging from narrow technical definitions to 
broad ranging qualitative descriptions intended to 
identify the commonalities across a diverse range of 
systems [SFI, 2002]. The alternative definitions also help 
identify “the similarities and differences among terms 
often used interchangeably with ‘robustness’ including 
‘stability,’ ‘resilience,’ ‘reliability,’ 'persistence, 
’survivability,’ ‘fault-tolerance,’ ‘plasticity,’ etc.” 

One of the general definitions best suited to resilience in 
ALS comes from an ecological perspective: ‘‘Robustness 
is the persistence of specified system features in the 
face of a specified assembly of insults.’’ [Allen, 2002] In 
the case of life support systems, the persistent system 
features include (hopefully) a continuous supply of clean 
water, food and breathable air being available to the 
crew. The assembly of insults consists of unexpected 
system events (for example unexpectedly large 


1 Two interdisciplinary groups studying the issues of resilience and 
robustness maintain websites with overviews, definitions, discussion 
groups and extensive references. The Santa Fe Institute’s Robustness 
Program seeks to ’’explore the phenomenology, origins, mechanisms, and 
consequences of robustness in natural, engineering, and social systems" 
and can be found at http://discuss.santafe.edu/robustness/ 

The Resilience Alliance can be found online at 
http://www.sustainablefutures.net/resilience/resilienceDef.html 


variations in resource usage by the crew) and 
component faults, such as tanks leaking, sensors failing, 
and so on. 

Holling distinguishes between ‘engineering resilience,’ 
which focuses on “stability near an equilibrium steady- 
state, where resistance to disturbance and speed of 
return to the equilibrium are used to measure the 
property” and ‘ecological resilience,’ which measures 
“the magnitude of disturbance that can be absorbed 
before the system changes its structure by changing the 
variables and processes that control behavior.” 
Ecological resilience focuses on off-nominal behavior far 
from equilibrium and the possibility that the entire system 
can enter another (undesirable) regime. [Holling 2002] 
Both two aspects apply to ALS which has dual goals of 
shortening the time to return to nominal behavior and to 
avoiding a state of critical failure. 

The closed and semi-closed natures of regenerative 
systems pose unique problems in measuring robustness 
or resilience. Typical approaches such as equating 
robustness with the quantity of unused resources that 
could be used to recover from a fault are unsuitable in 
closed systems because increasing the resources held 
in a particular buffer necessarily reduces the amount held 
elsewhere. This makes it hard to identify the states or 
operating modes that have the greatest capacity to 
recover from failures. Furthermore, resilience is a 
property not just of the hardware that makes up a system, 
but also of the way that hardware is controlled; two 
different control approaches can produce vast 
differences in the resilience of the same physical system. 
For these and other reasons outlined below, the analysis 
of the robustness of a system is best done dynamically, 
by measuring the behavior of the system over time as 
faults occur. 

Resilience is a dynamic, rather than a static, property of 
the system. To see that this is the case, consider the 
definition of resilience as “the persistence of specified 
system features in the face of a specified assembly of 
insults. "Thai is not to say that in a resilient system faults 
do not occur, but that when (some specified set of) faults 
or unexpected events occur, a resilient system can 
recover from them and return to normal operation. Thus 
the property of interest is the behavior of the system 
over time. Resilience involves determining whether 
normal system operation resumes after a fault, and 
whether this occurs within an acceptable amount of time 
(e.g. before the crew become ill). Dynamic analysis of the 
system is the only way these issues can be determined, 
and static measures such as equivalent system mass 
(ESM) cannot hope to fully capture these details. For 
some simple systems, exact dynamic analyses can be 
made directly by examining the system model. Because 
of the complexity and non-linear dynamics typical of 
advanced life support systems (ALSS), such analysis is 
most easily performed through simulation. 


The next section examines the current ALS metric used 
to evaluate life support systems and details why it is not 
an adequate measure of resilience. The following 
section describes the specifics of our approach. The 
next describes the simulation test bed model of a 
generic WRS. Finally preliminary results and analysis of 
simulation data demonstrate how our proposed 
measures of resilience work in practice. 

THE ALS LIFE SUPPORT METRIC 
AND RESILIENCE 

We now briefly review the current advanced life support 
(ALS) metric, ESM. The ESM of a technology or 
subsystem is traditionally computed from static analyses, 
assuming nominal operation. ESM computations 
consider the mass, volume, power, cooling and crewtime 
requirements of the system under nominal behavior. The 
life support needs for volume, power, cooling, and 
crewtime are converted to units of mass to represent the 
required launch mass of the entire life support system 
(LSS). Mass units are used because the launch mass of a 
system is commonly correlated to mission cost. The 
reader is referred to Levri et al (2000) for a more detailed 
explanation of the static method of ESM computation. 

While ESM is a useful metric for evaluating the launch 
cost for a system that always performs nominally, it does 
not measure resilience to faults. One of the main points 
in this paper is that resilience is a dynamic property of a 
system as a whole, and as such, is not adequately 
measured by metrics such as ESM. Valid ESM 
comparisons require that systems “satisfy the same life 
support product quantity, product quality, reliability and 
safety requirements. In situations where product quality 
is somewhat subjective or the level of safety or reliability 
is not well defined, the researcher's expertise on those 
issues must be used to estimate appropriate 
requirements and relevant adjustments in ESM” [Levri, 
et al, 2000]. Because ESM requires that two candidate 
systems must be equally resilient for a valid comparison, 
the development of explicit measures of resilience will 
complement and enhance this metric. 

ESM takes account of system robustness only in the 
time the crew spends on nominal maintenance, and the 
mass of spare parts needed for that nominal 
maintenance. The ability of a system to respond to off- 
nominal events is not captured in ESM. ESM is unable to 
reflect the improvement in a system that is made by 
simply changing the controls approach, without 
changing the mass, volume, power, cooling or crewtime 
needs. In other words, if system A and system B are 
equivalent in mass, volume power, cooling and crewtime 
needs, but system A is, in general, more resilient to off- 
nominal events, this advantage is not reflected in the 
ESM measure. Attempts should be made to apply the 
same rigor used in estimating ESM for two competing 
technologies to analyzing their robustness, resilience, 


and ability to meet critical performance requirements 
under off-nominal operating conditions. 

A reliability measure that is often used is the mean time 
before failure (MTBF). MTBF is typically calculated for 
each component of a system, and then summed for all 
system-critical components, with redundant components 
reducing the total correspondingly. This exercise arrives 
at the MTBF for the whole system [Jones, 1999], 
Unfortunately, for a complex system this estimate may be 
wildly inaccurate since it assumes a very simplistic 
relationship between component failures and overall 
system failure. In particular, it ignores interactions 
between non-critical system failures. In the simple WRS 
modeled in this paper, a series of bed breakthroughs, 
which are part of nominal system operation, can lead to 
increases in contaminant concentration in greywater 
tanks, which in turn cause more breakthroughs and 
potentially critical system failures. The MTBF of the beds 
does not, in isolation from additional data about the state 
of the system and the control system in place, provide a 
useful estimate of the probability of system failure. 

Estimating the MTBF for the system as a whole in the 
presence of multiple, randomly injected, component- 
level faults is, in fact, one aspect of measuring a system’s 
resilience. Ultimately, the complex interaction of faults, 
controls, and system dynamics, rather than simple fault 
probabilities, determines resilience. 

METHODS 

COMPUTATIONAL MODELING 

Our approach to measuring resilience utilizes data from 
repeated simulation of a computational model of the 
system. Because resilience is an inherently dynamic 
property of a system, static measures that don’t consider 
off-nominal behavior or account for control system 
responses to faults do not adequately capture resilience. 
Running extensive simulations of a model allows us to 
observe the dynamic effects of and the interactions 
between component faults, control system decisions 
and random variations in system inputs. This data is then 
analyzed to test the predictive power of different 
measures of system resilience. 

To measure the resilience of a system by this method we 
develop the following: 

1 . A system model that exhibits both 
nominal and off-nominal system behavior and 
includes the uncertainty inherent in the system, 
for example in the amount of resources used by 
the crew of a life-support system or the 
contaminant capacity of a filtration bed; 

2. A control system that handles both nominal 
and off-nominal operating conditions; 


3. A fault model that describes the likelihood of 
each possible system fault occurring (possibly as 
a function of the system state); 

4. An evaluation method for comparing the 
simulated performance of the system to the 
corresponding life support goals. 

A single simulation of the model starting from a given 
initial condition is called, prosaically enough, a run, and 
the sequence of states that the system passes through 
during the entire run is its trajectory. While technically the 
state of a system is defined to be the minimum set of 
information required to derive the subsequent system 
behavior, here the term ‘state’ and ‘state data’ refer all of 
the descriptive data gathered during a run, while ‘system 
performance’ and 'performance data’ refer to the much 
smaller set of simulation data that directly reflect the 
ability of the system to achieve critical life support goals. 
Examples of state and performance data for this work are 
shown in Tables 1 and Table 2, respectively. 

A model for examining system resilience needs to be 
detailed enough to have non-trivial dynamics and faults 
that occur at the component level, yet simple enough 
that its nominal behavior is well understood and that 
intuitive notions of system resilience can be confirmed 
with data. For example, it seems reasonable to assume 
that a water recovery system where the filtration beds 
have a higher capacity to remove contaminant is, other 
things being equal, more resilient than one with a lower 
bed capacity. The system performance statistics from 
simulations confirm this rule-of-thumb notion of 
resilience, allowing us to use simulation data for systems 
that differ only in bed capacity to propose and test 
measures of system resilience. The measures of 
resilience are then compared to data for the same 
systems with induced probabilistic faults, to see if the 
measures are useful predictors of system performance 
in the presence of unanticipated faults. 

A major strength of this approach is that it provides data 
about off-nominal behavior and examples of situations in 
which the system failed. Analysis of these can give 
valuable insights into the weaknesses of the system. 
Dynamic simulation-based data can be used to find the 
total probability of failure given the fault model or to 
determine the most probable trajectory of the system 
that leads it to fail. This analysis helps system designers 
identify the most vulnerable parts of the systems, not in 
terms of the probability of the underlying fault but in 
terms of its dynamic effect on system performance. In 
addition, the data can support qualitative statements 
about the performance of the system, such as “the 
system is resilient to all single faults that were simulated.” 
Finally, this approach allows direct comparison of the 
effects of two different control strategies applied to the 
same hardware or the effects of a particular hardware 
change on the performance of the system. 


PROPOSED MEASURES OF RESILIENCE 

We begin by describing two relatively simple approaches 
to measuring resilience, summary statistics, that describe 
performance over a set of runs with a single number, and 
correlated system characteristics, where resilience is 
measured indirectly by examining other features of the 
system that are intuitively connected with resilience. 

Simple summary statistics such as the proportion of the 
runs in which the system continues to perform normally 
provide a starting point for examining resilience and a 
point of comparison for more sophisticated measures. 
The resilience of two different systems to a given set of 
faults can be simply compared using statistics that 
summarize how often each system fails on a set of 
random runs including those faults. 

Another attractive approach is to hypothesize that there 
are certain characteristics of the system, perhaps 
identified by a domain expert, that have some intuitive 
correlation with overall system robustness, but are easier 
to measure. These characteristics might provide useful 
measures of resilience. In the WRS model presented 
below, we might expect that the amount of reserve water 
in the tanks or the number of times contaminated water 
reaches the potable water tanks averaged over a number 
of runs, would be correlated with resilience. A system 
with lower amounts of reserve clean water is intuitively 
less resilient than one with greater reserves, and we 
might expect this relation show up in the summary 
statistics as a higher number of successful runs when 
there is more reserve clean water. 

Summary statistics and correlated system characteristics 
can be used to compare the performance of competing 
systems, but may not provide a lot of information about 
design changes that could be made to improve a 
system’s resilience. In addition, correlated system 
characteristics are only useful for comparing similar 
systems and control strategies. For example, a control 
strategy specifically tuned to work with lower overall water 
availability may be more resilient than other strategies 
even though it typically has lower water reserves. 

A more sophisticated approach, which we are currently 
investigating, would assign a probability to each run that 
weights the data by the likelihood of the underlying faults 
and system inputs that produced it. Runs with very 
unlikely faults, or multiple faults, would have low 
probabilities associated with them, while runs with no 
faults, or single common faults would have higher 
probabilityNow rather than using the proportion of 
successful runs to evaluate resilience, we can talk about 
the total probability mass of the successful runs, which 
should provide a more accurate estimate which of two 
systems is better. It may also help identify which design 
modifications are most likely to improve system 
resilience, since the highest probability runs with 


unsatisfactory performance are the most likely events 
that the system can’t recover from. 

MEASURING RESILIENCE WITH MARKOV CHAINS 

In this paper we propose a novel approach to measuring 
resilience based on the parameters of a Markov chain 
model of system performance. A Markov chain is a 
concise way of describing the probabilities that the 
system will move from state to state 2 . Figure 1 shows a 
two-state Markov chain model of a system where state 1 
represents normal system behavior and state 0 
represents abnormal behavior. In the simulation model 
developed here, state 1 represents 'astronaut demand 
for clean water is being met’ and state 0 represents 
‘astronaut demand for clean water is not being met.' In 
other life support applications, the states might 
represent acceptable versus higher than acceptable 
levels of C0 2 in the atmosphere. 

Figure 1 : A two-state Markov chain of system performance 
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Suppose that the system is currently in state 1 . There are 
two possible outcomes in the next time step: the system 
can stay in state 1 or transition to state 0. The parameter 
a is the probability that the system moves to state 0, 
given that it is currently in state 1 . Conversely, (1 - a) is 
the probability that the system stays in state 1 , given that 
it is currently in state 1. p is the probability that the system 
moves to state 1 , given that it is currently in state 0 and 
(1 - p) is the probability that the system stays in state 0, 
given that it is currently in state 0. 

Estimating the paramaters a and p from the data 
collected from individual runs or aggregated over 
multiple runs is straightforward: simply count the number 
of times that each event occurred [ (1 -> 1), (1 -> 0); (0 -> 
0), (0 -> 1) ] and divide by number of times the system 
was in the relevant state. In other words, from an 
empirical point of view a is simply the number of times 
that the system changed from state 1 to state 0 divided 
by the total number of times the system was in state 1 . 
Note that the each run must be broken up into discrete 


2 The key feature of Markov chain models is that the future state depends 
only on the current state, not on the entire history of the system up to that 
point. However, the vailidity of this assumption is not critical to its 
usefulness as a measure of robustness — what matters is that the Markov 
chain model provides a reasonable facsimile of the system behavior being 
analyzed. 


time periods to gather this data, and that the value of 
a and P are not independent of the choice of period. 

A high p indicates that even if the system enters the 
abnormal state, it is highly likely to return to the normal 
state, because the system is more resilient than one 
where beta is low. Similarly, a low a is desirable because 
it indicates that the system is unlikely to enter the 
abnormal state in the first place. Resilient systems, that 
is, systems with the ability to successfully recover from 
faults, are likely to have high values of p and low values of 

а. Consequently, our proposed measure of resilience is 
simply p /a, which eliminates the units of time from the 
measure. This is a summary statistic or aggregate 
measure of system resilience over a given run or a 
specified length of time, consequently, the individual 
runs must be long enough to provide adequate data for 
estimating the parameters for the measure to be provide 
meaningful information. 

A HYBRID COMPUTATIONAL MODEL FOR 
MEASURING RESILIENCE 

A model of a simplified WRS was developed to measure 
and analyze the resilience of regenerative life support 
technologies. The goal in designing the test system was 
not verisimilitude to a particular WRS, but rather, the 
creation of a relatively generic example of a regenerative 
life support subsystem constrained by conservation of 
mass. This simplified model provides a computational 
test-bed for developing methods to identify and 
measure resilience early in the design stage. 

A GENERIC WATER REVITALIZATION SUBSYSTEM 

The system consists of four main elements as shown in 
Figure 2: 

5. Astronauts, who demand clean water for 

hygiene and drinking and introduce contaminant into 
the system. Detailed information on water use 
appears in Table 7 in the appendix. 

б. Waste Water Tank(\NWT), which receives 
greywater from the astronaut chamber and which 
control the variable flow of water to the filtration beds. 
The WWT also receives greywater directly from the 
potable water tanks when sensors indicate 
unacceptable water quality. 

7. Filtration Beds, which remove contaminant from 
the greywater. Two beds operate in parallel with 
water flowing over one bed at a time. Beds remove 

1 00% of inflow contamination until their capacity is 
reached. When the contaminant in the beds 
exceeds capacity they ‘break through’ and remove 
no contaminant (a step function). After breakthrough 
occurs, beds are regenerated to a new, randomly 
chosen capacity. The regeneration process takes 
one hour. 

8. Potable Water Tanks (PWT), which receive clean 
water from the filtration beds and hold it during 


testing. There are four parallel tanks. Only one tank 
can be filling at a time, but more than one tank can 
drain at a time, either to the astronauts or directly to 
the WWT. Note that the outflow valve from the PWT 
controls the flow of clean water into the astronaut 
chamber, though its setting is determined by 
astronaut demand. Tank sizes and other parameters 
are detailed in Table 8 in the appendix. 

Figure 2: Schematic of Generic WRS Model 



There are two parallel mass flows: water and a generic 
contaminant. Energy flow in the system (which would be 
added by pumping and lost by location change and 
friction in a physical system) is not explicitly modeled. In 
other words, mass flow is modeled but energy flow is not. 
Flows from element to element are controlled by switch 
valves (squares with x’s), which select the outlet or 
destination of the flow, and variable flow valves (circles 
with crosses), which control the rate of flow out of a tank. 

A separate control and sensor system collects data and 
controls the valve settings, the choice of filtration beds, 
and the potable water tank assignments. The sensors 
are attached to each of the PWTs and determine 
whether the water quality is acceptable or not. The key 
control decisions are: 1) the setting of the outflow valve 
on the WWT; 2) the choice of bed in use; 3) the potable 
water tank being filled and being drained. In the baseline 
control system, the variable setting for the outflow valve 
on the WWT depends primarily on the WWT tank level. 




The change of filtration bed in use is based on a fixed 
schedule, and the PWT fill and drain completely before 
switching occurs. Additional system specification data 
appears in Tables 7, 8 & 9 in the appendix. 

The quantity of water demanded, the capacity of the 
beds after each regeneration cycle, and the time that 
faults occur are generated randomly at the start of each 
simulation, but the simulation itself is deterministic. The 
flow of water and contaminant through the system is 
continuous, but many control decisions and fault 
occurrences are discrete, hence the WRS is a hybrid 
system. The simulation is implemented in Simulink, a 
hybrid system modeling language that works in 
conjunction with Matlab. 

FAULTS IN THE TESTBED WRS 

The simulation model is a tool for measuring resilience to 
component faults. Faults, along with possible control 
system responses and associated performance failures, 
are detailed in Table 9 in the appendix. For clarity of 
exposition, “faults' 1 refer to the unanticipated failure of 
individual components while “failure” refers to 
performance failures of the WRS such as not providing 
clean water to the astronauts or providing untested 
water. While many potential faults can, and in a real 
system should, be compensated for with the appropriate 
control system response, in this paper we allow faults to 
occur in order to observe the off-nominal system 
behavior and resulting performance failures, (in other 
words, none of the control system responses detailed in 
Table 9 are actually implemented in the simulations 
presented here; they are merely listed as examples.) 

Multiple faults can also interact, reducing the resilience of 
the system even more dramatically than a single, critical 
fault. A sensor fault makes it more likely that dirty water is 
sent to the astronauts, and can be compensated for by 
decreasing the amount of time before bed regeneration. 
In fact, anything that increases the number of PWTs that 
get filled with contaminated water also increases the 
probability that contaminated water is sent to the 
astronauts, as does any valve failure that prevents the 
control system from correctly routing water through the 
system. In addition, these faults occurring 
simultaneously can interact to dramatically increase the 
probability that the astronauts receive contaminated 
water. The discussion below, for example, explains how 
an interruption of water service can make future bed 
breakthroughs more likely. 

Note that we do not consider bed breakthroughs as 
‘injected faults’, as they occasionally occur even in 
systems with high absorptive capacity which do not 
experience any performance failures The number of bed 
breakthroughs that occur depends on the absorptive 
capacity of the beds and the length of time before bed 
regeneration. A decline in the average bed capacity or 
the failure of a bed to regenerate are, however, 


considered faults. Tank overflows are considered 
performance failures, not faults 3 . 

System design is often an incremental process in which 
the robustness and other characteristics of the system is 
gradually improved over a series of iterations. The 
robustness measure used can also be expected to 
change over these iterations. For example, during initial 
design, a system might be optimized only against ESM or 
some other static measure. Once a satisfactory design 
has been reached, dynamic analysis of the system 
performance. in nominal operating modes might be used 
to further refine the system. Finally, a robustness 
measure such as the ones we are proposing can be 
used to dynamically analyze system performance in off- 
nominal scenarios, possibly even with different resilience 
metrics being used as the evolution of the system 
continues. These steps should be iterated as necessary. 

In a similar, incremental manner, in the phase of 
robustness measurement, the control system should 
also begin with no failure prevention strategies included. 
Measuring the resilience of this stripped-down system 
gives information about the most important faults, and 
system changes designed to prevent those failures can 
then be added, the new system’s resilience can now be 
tested, and so on. This process ensures that the control 
system is not needlessly complex by only including 
control responses to faults that actually occur in practice 
and to which the system is not already robust. 

SIMULATION DATA & ANALYSIS 

First we demonstrate the nominal behavior of the 
system, that is, the system behavior without injected 
faults, through example simulations of three possible 
system performance scenarios. Next we present 
summary statistics for system performance under 
different specifications of bed capacity. We use this data 
to propose different measures of system resilience. 
Finally data for the same systems with probabilistic faults 
injected is used to verify that the measures of resilience 
are in fact useful predictors of system performance in the 
presence of unanticipated faults. 

NOMINAL SYSTEM BEHAVIOR: EXAMPLES 

• Case 1: some bed breakthroughs, no 
interruptions of water service; 

• Case 2: some bed breakthroughs, some non- 
life threatening interruptions of water service; 

• Case 3: bed breakthroughs leading to complete 
system breakdown and loss of crew. 


3 Note that all tanks have overflow valves designed to prevent them from 
holding more water than their capacity, excess water flows onto the ‘floor 1 
and is not returned to the system in this simple example. Of course, if the 
overflow valve experienced a failure 


In all three cases the filtration beds have a ‘medium 
capacity’ (see Appendix for specification details). 

Case 1 : No Performance Failures 

We start with Case 1 , where some filtration beds break 
through but clean water is supplied continuously to the 
astronauts. In Figure 3, the blue line shows the filtration 
bed breakthroughs. The signal 1 indicates a 
breakthrough and 0 indicates no breakthough, 
consequently, when a breakthrough occurs it appears as 
a vertical blue line. There are 8 breakthroughs in this run. 
The red line indicates water availability with 1 indicating 
that the demand for clean water is met and 0 indicating 
that no clean water is available. Figure 4 shows the clean 
water reserves in the system and Figure 5 shows the 
dynamics of the four potable water tank levels. 

Figure 3: Case 1 , Bed Breakthroughs and Water Availability 



Figure 4: Case 1 , Clean Water Reserves 




Relatively low demand in the period (175-225), and the 
corresponding slow drainage of clean water from the 
tanks in that period (indicated by the sparse section of 
Figure 5), leads to an increase in clean water reserves. 
High demand causes low water reserves in the period 
(250-350), the subsequent breakthroughs that occur in 
the (325-375) range almost to an interruption in water 
service. 


Case 2: Non-Critical Interruptions of Water Service 

In Case 2 the system also has some filtration bed 
breakthroughs, but also experiences some interruption 
of water service. Figures 6, 7, & 8 present the data for 
this case. The first vertical red line indicates a short 
service interruption of 3 minutes, the next two vertical 
red lines demarcate a more serious service interruption 
lasting 6 hours and 37 minutes. Note that the water 
shortage is caused by the need to empty multiple dirty 
PWTs in the period (275-300) caused by the earlier bed 
breakthroughs. The relatively dense colors in this period 
in Figure 8 result from this filling and rapid draining of 
tanks. There are 7 breakthroughs in this run. Note that 
fewer breakthroughs occurred in Case 2 than in Case 1 , 
but that in Case 2 the breakthroughs resulted in service 
interruptions. 

Figure 6: Case 2, Bed Breakthroughs and Water Availability 



Figure 7: Case 2, Clean Water Reserves 



Figure 8: Case 2, Potable Water Tank Levels 



Case 3: Critical Performance Failure 

The final case demonstrates how this simple WRS with a 
control system that does not compensate for 
component failures can fail completely. Figures 9, 10, 1 1 
& 12 show the system breakdown around time 175. The 
single vertical red line shows the end of water service to 
the astronaut chamber. The almost solid swath of color in 
Figure 1 1 shows the frantic but ultimately unsuccessful 
filling and draining of the PWTs with contaminated water. 







How does this critical failure occur? The key point is that 
the astronauts continue to produce contaminant even 
when no water is available 4 . (Otherwise, a system with 
service interruptions would have less contaminant to 
process overall.) This results in a higher concentration of 
contaminant in the WWT, shown in Figure 12 In the 
absence of a compensating control action like 
decreasing the amount of time between bed 
regenerations, this rise in the contaminant level makes 
additional bed breakthroughs more likely in the future. 
This failure propagation mechanism or positive feedback 
loop can, under certain conditions, lead to total system 
failure. 

Figure 9: Case 3, Bed Breakthroughs and Water Availability 


Figure 10: Case 3, Clean Water Reserves 



Figure 1 1 : Case 3, Percent Contaminant in WWT 



Figure 12: Case 3, Potable Water Tank Levels 
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SUMMARY STATISTICS AND DATA COLLECTED 

The analysis and measurement of system resilience 
supports the overriding goal of ensuring crew safety and 
health. The contribution of the WRS is, of course, 
providing a reliable supply of contaminant-free water for 
astronaut hygiene and drinking. The possible 
performance failures that we consider, in rough order of 
severity, are: 

1 . interruption of water service for more than 72 
hours 

2. untested contaminated water sent to astronauts; 

3. interruption of water service for less than 72 
hours; 

4. untested clean water sent to astronauts; 

5. loss of system water through leaks or overflows. 

Performance failures 2, 3 & 5can only occur with injected 
faults, for example, faults involving sensors, valves or 
controls. 

The key question we are trying to answer is: What data 
best measures or predicts the ability of the system to 
avoid performance failures in the presence of faults? This 
suggest two subsidiary questions: What data should we 
collect and analyze in order to develop a measure of 
system resilience? What data is likely to be available in a 
functioning WRS? 

Tables 1 and 2 summarize the data collected from the 
simulations. There are two basic types of data: time 
series data, which consists of instantaneous 
observations of the state of the system at each time 
point, and summary data, which consists of state and 
performance data aggregated over an entire simulation 
or a given length of time. (Of course, the summary data is 
calculated from the time series data.) 




4 For ease of programming the contaminant simply continues to flow on 
schedule to the WWT without the corresponding clean water flow. A more 
realistic alternative would be for the contaminant produced during 
interruptions of water service to be stored in the astronaut chamber and 
then flushed out at higher rate once service resumes. The effect on the 
system will be the same in both cases: the concentration of contaminant in 
the WWT will increase whenever the demand for water is not met. 





Table 1: State Data Collected 

times series data 5 

summary data 

% contaminant in WWT 

average % contaminant in 
WWT 

WWT level 

average WWT level 

bed breakthrough 
status 

total number of bed 
breakthroughs 

% bed capacity used 

average and/or total % bed 
capacity used 

— 

total number of bed changes 

clean water reserves 

average clean water reserves 

contaminated water in 
PWT 

total number of contaminated 
PWTs drained to WWT 

untested water in PWT 

... 

astronaut demand for 
water 

total astronaut demand for 
water 

supply of water to 
astronauts 

total supply of water to 
astronauts 

— 

total water processed 

contaminant produced 
by astronauts 

total contaminant produced 
by astronauts 

... 

total contaminant processed 


Table 2: System Performance Data Collected 

times series data 

summary data 

% demand for water 
being satisfied 

total % demand for water 
satisfied 

— 

number of water service 
interruptions 

— 

total length of water service 
interruptions 

— 

maximum length of water 
service interruptions 

untested clean water 
flowing to astronauts 

total quantity of untested 
clean water sent to astronauts 

contaminated water 
flowing to astronauts 

total quantity of contaminated 
water sent to astronauts 

— 

number of times astronauts 
received contaminated water 

PWT, WWT tank 
overflowing or leaking 

total water lost due to 
overflows or leaks 


The distinction between the two data types is useful for 
more than ease of presentation— it also corresponds to 
two distinct, complementary ways of thinking about 
resilience. Aggregate statistics collected over a long 
period time can be used to measure the overall resilience 
of a particular system. An aggregate or summary measure 
of resilience answers the question, on average, how well 


6 Additional date collected includes the tank level, percent contamination, 
filling status, draining status and test status of each of tie PWTs; the valve 
settings, the quantity of water in the astronauts; and the total water in the 
system. 


can this system recover from faults? Summary measures 
are particularly useful when comparing alternative system 
specifications. Time series data, on the other hand, is 
necessary to develop an instantaneous or conditional 
measure of resilience that reflects not only the system’s 
overall design but also its current state. An 
instantaneous measure of resilience reflects the 
system’s ability to recover from faults, conditioned upon 
the current state of the system. Conditional measures of 
resilience are useful in identifying the best nominal 
operating point of a system and may provide an early 
warning of system vulnerability before faults have 
occurred. 

The Markov chain measure of resilience proposed above 
and most of the statistics presented here are aggregate 
measures of resilience and system performance. Future 
research will examine instantaneous measures of 
resilience. 

NOMINAL SYSTEM BEHAVIOR: 

Table 3 presents performance data aggregated across 
runs, of systems with three different average bed 
capacities but no injected faults. The performance 
measures correspond to Cases 1, 2, & 3 from the 
previous subsection. 


Table 3: Summary Data - Nominal Operation 

summary data 


WEGBM 

med 

low 

total number of runs of 
1200 hours 

18 

20 

23 

percentage of runs with 
no service interruption 

78 % 

60 % 

35 % 

percentage of runs with 
some service interruption 

22 % 

30 % 

39 % 

percentage of runs with 
critical failures 6 

0% 

10% 

26 % 


The summary data from individual runs is used in Figure 
13, which plots the number of bed breakthroughs versus 
the number of service interruptions for each of the 53 
runs without critical failures. Each red asterisk represents 
one 1200 hour run with low bed capacity, blue and green 
are runs with medium and low capacity, respectively. The 
graph shows that although bed breakthroughs are the 
only source of performance failures in these simulations, 
they are not, on their own, good aggregate predictors of 


6 We are aware that most selt-respecting life support engineers would 
likely find the number of critical failures reported here unacceptable. More 
sophisticated control strategies, for example, switching beds immediately 
after the sensors indicate that a PWT has been contaminated, would 
perhaps lead to better system performance and recovery from certain 
kinds of faults. However, the goal is not to optimize the current system but 
rather to develop methods for measuring and analyzing the resilience of 
the system in the presence of probabilistic faults, which requires 
examination of off-nominal behavior. 


















system performance. This emphasizes the importance of 
dynamic analysis in understanding resilience. 

Figure 13: Number of Bed Breakthroughs versus Number of Service 
Interruptions, (red = low bed capacity, blue = medium, green = high) 


service 

interruptions 



bed breakthroughs 


Recall the parameters of the Markov chain described 
above: a represents the probability that system goes 
from the desired state of providing clean water to the 
failure state of not providing water and p represents the 
probability of moving out of the failure state to the 
desired state and. Despite the rich array of data 
collected the parameters (a, P) and the related measure 
of resilience, p/a, described above can be derived from a 
few simple statistics: the number of times the system 
failed to provide water 7 , the total length of time the 
system failed to provide water, and the total length of 
time under consideration 8 . Note that the measure is not 
defined for runs of lengths of time where no 
performance failures occurred, and is not a useful 
description of a system with critical failures. 

Figure 1 4, found in the Appendix, shows the parameter 
estimates for individual runs with non-critical failures, as 
well as aggregate estimates for all runs and for all runs 
without critical failures. Note that the inclusion of critical 
failures dramatically changes the parameter estimates for 
the medium and low bed capacity. Measures derived 
from data without critical failures, taken by itself, will tend 
to overstate the resilience of systems with critical failures. 
However, including the data for performance failure long 


7 Technically, it also requires the number of times that the system 
successfully provided water — this is the same as the number of times 
that the system failed to provide water if water is being provided at the end 
of the run. 

8 For these calculations time is measured in hours. As mentioned 

previously, the estimates of a and P are not independent of units of time. 

The length of time is either a single run or a group of runs aggregated 
together. It could also be a subset of a particular run. 


after the system has ceased functioning also skews the 
results 9 . 

From the point of view of evaluating system 
performance, are lots of short service interruptions 
considered to be more or less serious than fewer longer 
interruptions? If the first scenario is preferred then a high 
P should be given more weight than a low a, and vice 
versa if the second scenario is preferred. 

Our proposed measure is simply the ratio of the two 
parameters: p/a. What does this measure say about the 
resilience of the three systems with different bed 
capacity? Table 4 presents the estimates of the Markov 
chain parameters for aggregate data for all runs taken 
together, without and with the critical failures included. 

Table 4: 

Markov Parameters - Nominal Operation 


Measu 

Bed Capacity 

re 

high 

med 

low 

med* 

low* 

a 

0.0003 

0.0004 

0.0010 

0.0005 

0.0015 

p 

0.2682 

0.3436 

0.1977 

0.0084 

0.0053 

p/a 

964.39 

961.08 

191.03 

16.754 

3.5756 


includes data from runs with critical failures 

The proposed measure of resilience confirms the 
intuition that a system with higher bed capacity should be 
more resilient than one with a lower bed capacity. 
However, the difference between the measured 
resilience of the low and medium capacity is large, but 
the difference between that of the medium and high 
capacity is quite small (excluding the runs with critical 
failures). This reflects the similar performance of the two 
systems, for example, there is no water service for 
0.10358% of the time with high capacity versus no water 
service for 0.10394% of the time with the medium. This 
would indicate to a system designer that the marginal 
benefit of improving the average bed capacity the first 
0.05 increment is large, whereas after that an additional 
0.05 increase provides only a small increase in the 
reliability of the system 10 . 

SYSTEM BEHAVIOR WITH A VALVE FAULT: 

Next we consider the behavior of the system and the 
corresponding measures of resilience when a valve fault 
occurs. Again, there are three different bed capacities. 
The fault occurs in the switch outflow valve of PWT 4, 
which sticks in the position that routes water to the 
astronaut chamber. Consequently, if the tank is filled with 
greywater as the result of a bed breakthrough there is no 

9 A reasonable alternative, not pursued in this version due to time 
constraints, would be to develop a Markov chain model with a third 
absorbing state representing system failure. An empirical compromise 
might be to eliminate the data from the time that system is determined to 
have failed, after 72 hours, to the end of that run. 

10 Keeping in mind that a more sophisticated control would likely avoid the 
critical failures. 



way to drain it to the WWT and it cannot be used. The 
PWTs are now three tanks in parallel. 

Table 5 presents the summary data for this scenario. 
Unfortunately, the much smaller sample size makes 
direct comparison with the previous results more 
difficult 11 . In particular, the data for the low bed capacity 
does not correspond to the same range of initial 
conditions for contaminant production and bed capacity 
after regeneration. 


Table 5: Summary Data - Valve Fault 

summary data 

Bed Capacity 

high 

med 

low 

total runs of 1200 hours 

7 

9 

1 1 

no service interruption 

43 % 

22 % 

35 % 

some service interruption 

43 % 

67 % 

47% 

critical failures 

14% 

11% 

18 % 


Note that the high and medium capacity cases both had 
one critical failure each, the higher percent of critical 
failures with high bed capacity reflects the smaller 
sample size. The low capacity case had two critical 
failures. 

Table 6 presents the estimates of the Markov chain 
parameters and measure of resilience for data 
aggregated across all runs, both without and with the 
runs that experienced critical failures. In all cases the 
systems with the fault have lower measured resilience 
than the corresponding systems without faults. The 
proposed measures still show that systems with higher 
bed capacity are more resilient, but the differences 
between the three systems are now more evenly 
distributed. Again, this correctly reflects the relative 
system performance. (The data that included critical 
failures does not follow the previous pattern, again, the 
small sample size and relatively few critical failures in the 
low capacity case in this scenario account for the 
difference.) 


rabl 

e 6: Markov Parameters with valve fault 

Vie 

iSU 

■e 

Bed Capacity 

high 

med 

low 

high* 

med* 

low* 

a 

0.0033 

0.0018 

0.0033 

0.0029 

0.0018 

0.0033 

3 

0.3914 

0.1803 

0.2057 

0.0195 

0.0677 

0.0297 

3/a 

139.91 

100.84 

62.280 

6.8196 

37.484 

8.9422 


■includes data from runs with critical failures 


In order to fully test the usefulness of the proposed 
measure additional data from the one fault considered 
here and data from a wider variety of faults would be 
used to test the predictive capacity of the measure in 
different scenarios. 


11 More data on the way! Because of numerical considerations the runs 
with faults and performance failures take siginificantly more time to 
simulate. 


CONCLUSION 

This paper argues that crew safety depends on the 
resilience of the ALS system; its ability to achieve crucial 
performance goals in the presence of unanticipated 
faults. Static measures of system quality are unlikely to 
capture this important characteristic. We propose a 
method for analyzing system resilience early in the 
design process, rather than waiting until detailed design 
information is available. We develop a computational 
model of a WRS that serves as a testbed for different 
measures of resilience, and propose a novel measure 
that uses the parameters of a Markov chain 
representation of system performance to measure 
resilience. The measure performs well in the (very) 
preliminary analysis of off-nominal system behavior. 
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Table 7: Astronaut demand for water use and contaminant produced 12 

event 



range 

flow rate 

% cont 

total use 

cont 

shower 

6 

6 min 

3 - 9 min 

48.0 lit/hr 

0.5 

28.80 

0.1440 

hand/face 

6 

3 min 

1.5 -4.5 min 

54.4 lit/hr 

2.0 

19.32 

0.3864 

teeth 

12 

2 min 

1 - 3 min 

10.8 lit/hr 

7.0 

4.32 

0.3024 

lEBBISSH 

24 

1 min 

0.5 - 1.5 min 

44.4 lit/hr 

0.0 

17.76 

0.0000 

urine 

24 

1 min 

0.5 - 1.5 min 

44.4 lit/hr 

20.0 

17.76 

3.5440 


Table 8: Tank, valve, sensor and filtration bed parameters 

tanks 

size 

■BBSS 

max flow 

controlled by 

WWT 

200 lit 



inflow determined by outflow of dirty water from 
astronauts and draining of dirty water from 
PWTs; 

PWT 

50 lit 



inflow determined by outflow from WWT ; 
tank filling chosen by control system, ‘untested’ 
tank with highest level filled first; 

valves 





WWT outflow 


variable 

5 lit/hr 

valve setting is proportional to tank level, 
nominal outflow = 3, higher if level > 50% 
max = 5 when level > 90%, min = 2; 

filtration bed inflow 

— 

switch 

— 

beds switched every 5 hours 

PWT outflow to 
astronauts 

— 

variable 

based on 
demand 

astronaut demand 

PWT outflow to 
astronauts 

— 

switch 

— 

tank draining chosen by control system, ‘clean’ 
tank with lowest level drained first 

PWT outflow to WWT 

— 

variable 

50 lit/hr 

‘dirty’ tanks drain to WWT at the max flow rate 
immediately after sensors detect contamination 

sensors 

Attached to each PWT, ‘test status’ consists of untested/being tested, clean, 
dirty/contaminated. An empty or filling tank is ‘untested’. Tests take one hour. 

filtration beds 

Capacity after regeneration is uniformly distributed over the ranges: 

[1,2] (low capacity); [1.05, 2.05] (medium capacity); [1.1, 2.1] (high capacity.) 


12 The lengths of the water-using events are uniformly distributed over the given range. Astronaut data is based on 6 astronauts using on 
average a total of 70.2 liters of water per day and producing on average 4.3768 liters of a generic contaminant per day at a rate of .1 824 liters 
per hour. Note that astronauts do not respire or perspire water. 


































































Table 9: Component Faults, Control Responses and Performance Failures 


system 

element 


WWT 



filtration beds 





astronauts 


outflow valve stuck open 


outflow valve stuck closed 


tank leaks 


complete failure of bed, 
cannot be regenerated 


switch valve stuck 


decline in average bed 
capacity after regeneration 
(not directly observable) 


bed fails to regenerate during 
one cycle (not observable) 


inflow switch valve stuck 




outflow switch valve stuck 


variable outflow valve stuck 



sensor failure false positives 


sensor failure false negatives 


possible control system 
responses 


drain excess untested water 
from PWT back to WWT, if 
necessary, to prevent PWT 
overflow 


limit supply of water to 
astronauts to drinking water, if 
necessary, to prevent WWT 
overflow 


adjust WWT valve control so no 
water flows while remaining 
bed regenerates 


same as above 


decrease amount of time 
before bed regeneration 


interrupt water service while 
tank fills and tests, set WWT 
outflow valve to maximum 
setting, decrease amount of 
time before bed regeneration 
to compensate for increased 
percent contamination in WWT 


stop filling tank, tank out of 
service 


stop filling tank, use switch 
valve to route water to WWT, 
tank out of service 


regenerate beds more 
frequently 



increase flow rate over beds 



leave water running, increase 
water usage, decrease 
contaminant concentration 


produce excess contaminant, decrease time before bed 
same water usage, increase regeneration 
contaminant concentration 




decline in clean water 
reserves or service 
interruption if valve 
stuck at low level 


decline in clean water 
reserves or service 
interruption 


total water in system 
declines 


decline in clean water 
reserves or service 
interruption 


same as above 


decline in clean water 
reserves or service 
interruption, increased 
chance that dirty water 
in PWT tanks is sent to 
astronauts 


same as above, but 
much less severe 


service interruption, 
tank overflows if fault 
not observed, 
increased chance that 
dirty water in PWT tanks 
is sent to astronauts 


decline in clean water 
reserves or service 
interruption 


decline in clean water 
reserves or service 
interruption, increased 
chance that dirty water 
in PWT tanks is sent to 
astronauts 


increased chance that 
dirty water in PWT tanks 
is sent to astronauts 


decline in clean water 
reserves or sen/ice 
interruption 


decline in clean water 
reserves or service 
interruption 


decline in clean water 
reserves or service 
interruption, increased 
chance that dirty water 
is sent to astronauts 




































Figure 14: Estimates of p plotted versus estimates of a for nominal operation without induced faults, 
red = low bed capacity, blue = medium bed capacity, green = high bed capacity 
diamonds = individual runs, stars = all runs except critical failures, plus signs = all runs 



The fact that the blue star is to the right of the green star which indicates that the medium bed is more likely 
to leave the undesired state and return to full water service. This anomaly results from the exclusion of the 
two critical failures from the medium bed capacity data. Note the difference in position of the blue star, which 
shows the aggregated estimate for all runs with medium bed capacity except for those with critical failures, 
and the blue plus sign which shows the aggregated estimate for all runs with medium bed capacity, and 
similarly for the red star and the red plus sign, which show the low bed capacity case. 




